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Abstract 

Gaussian belief propagation (GaBP) is an iterative algorithm for computing the mean of a 
multivariate Gaussian distribution, or equivalently, the minimum of a multivariate positive 
definite quadratic function. Sufficient conditions, such as walk-summability, that guarantee 
the convergence and correctness of GaBP are known, but GaBP may fail to converge to the 
correct solution given an arbitrary positive definite quadratic function. As was observed 
by Malioutov et al. (2006), the GaBP algorithm fails to converge if the computation trees 
produced by the algorithm are not positive definite. In this work, we will show that the 
failure modes of the GaBP algorithm can be understood via graph covers, and we prove that 
a parameterized generalization of the min-sum algorithm can be used to ensure that the 
computation trees remain positive definite whenever the input matrix is positive definite. 
We demonstrate that the resulting algorithm is closely related to other iterative schemes 
for quadratic minimization such as the Gauss-Seidel and Jacobi algorithms. Finally, we 
observe, empirically, that there always exists a choice of parameters such that the above 
generalization of the GaBP algorithm converges. 

Keywords: belief propagation, Gaussian graphical models, graph covers 



1. Introduction 

In this work, we study the properties of reweighted message-passing algorithms with respect 
to the quadratic minimization problem. Let F G M"^" be a symmetric positive definite ma- 
trix and h € M"". The quadratic minimization problem is to find the x G that minimizes 
f{x) = ^x-^Fx — fi^x. Minimizing a positive definite quadratic function is equivalent to 
computing the mean of a multivariate Gaussian distribution with a positive definite covari- 
ance matrix or equivalently, solving the positive definite linear system Fx = h for the vector 

X. 

Gaussian belief propagation (GaBP), is an iterative message-passing scheme that can 
be used to estimate the mean of a Gaussian distribution as well as individual variances. 
Because of their distributed nature and their ability to provide estimates of the individual 
means and variances for each variable, using the GaBP and min-sum algorithms to solve 
linear systems has been an active area of research. 
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In previous work, several authors have provided sufficient conditions for the convergence 
of GaBP. Weiss and Freeman (2001a) demonstrated that GaBP converges in the case that 
the covariance matrix is diagonally dominant. Malioutov et al. (2006) proved that the GaBP 
algorithm converges when the covariance matrix is walk-summable. Moallemi and Van Roy 
(2009, 2010) showed that scaled diagonal dominance was a sufficient condition for conver- 
gence and also characterized the rate of convergence via a computation tree analysis. The 
later two sufficient conditions, walk-summability and scaled diagonal dominance, are known 
to be equivalent (Malioutov, 2008; Ruozzi et al., 2009). 

While the above conditions are sufficient for the convergence of the GaBP algorithm 
they are not necessary: there are examples of positive definite matrices that are not 
walk-summable for which the min-sum algorithm still converges to the correct solution 
(Malioutov et al., 2006). A critical component of these examples is that the computation 
trees remain positive definite throughout the algorithm. Such behavior is guaranteed if the 
original matrix is scaled diagonally dominant, but arbitrary positive definite matrices can 
produce computation trees that are not positive definite (Malioutov et al., 2006). If this 
occurs, the standard GaBP algorithm fails to produce the correct solution. 

One proposed solution to the above difficulties is to precondition the covariance matrix 
in order to force it to be scaled diagonally dominant. Diagonal loading was proposed as 
one such useful preconditioner in Johnson et al. (2009). The key insight of diagonal loading 
is that scaled diagonal dominance can be achieved by sufficiently weighting the diagonal 
elements of the covariance matrix. The diagonally loaded matrix can then be used as an 
input to a GaBP subroutine. The solution produced by GaBP is then used in a feedback 
loop to produce a new matrix, and the process is repeated until a desired level of accuracy 
is achieved. Unfortunately, this approach results in an algorithm that, unlike GaBP, is 
not decentralized and distributed choosing the appropriate amount of diagonal loading and 
feedback to achieve quick convergence remains an open question. 

Recent work has studied provably convergent variants of the min-sum algorithm. The 
result has been the development of many different "convergent and correct" message- 
passing algorithms: MPLP (Globerson and Jaakkola, 2007), max-sum diffusion (Werner, 
2007), norm-product belief propagation (Hazan and Shashua, 2010), and tree-reweighted 
belief propagation (Wainwright et al., 2005). Each of these algorithms can be viewed as 
a coordinate ascent /descent scheme for an appropriate lower/upper bound. A general 
overview of these techniques and their relationship to bound maximization can be found in 
Sontag and Jaakkola (2009) and Meltzer et al. (2009). These algorithms guarantee conver- 
gence under an appropriate message-passing schedule, and they also guarantee correctness 
if a unique assignment can be extracted upon convergence. Such algorithms are a plausible 
candidates in the search for convergent GaBP style message-passing algorithms. 

The primary contributions of this work are twofold. First, we demonstrate that graph 
covers can be used to provide a new, combinatorial characterization of walk-summability. 
This characterization allows us to conclude that "convergent and correct message-passing" 
schemes based on dual optimization techniques that guarantee the correctness of locally 
decodable beliefs cannot converge to the correct minimizing assignment outside of walk- 
summable models. 

Second, we investigate the behavior of reweighted message-passing algorithms for the 
quadratic minimization problem. The motivation for this study comes from the observation 
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that belief propagation style algorithms typically do not explore all nodes in the factor graph 
with the same frequency (Frey et al., 2001). In many application areas such uneven counting 
is undesirable and typically results in incorrect answers, but if we can use reweighting to 
overestimate the diagonal entries of the computation tree relative to the off diagonal entries, 
then we may be able to force the computation trees to be positive definite at each iteration 
of the algorithm. Although similar in spirit to diagonal loading, our approach preserves the 
distributed message-passing structure of the algorithm. We will show that there exists a 
choice of parameters for the reweighted algorithms that guarantees monotone convergence 
of the variance estimates on all positive definite models, even those for which the GaBP 
algorithm fails to converge. We empirically observe that there exists a choice of parameters 
that also guarantees the convergence of the mean estimates. 

In addition, we show that our graph cover analysis extends to other iterative algorithms 
for the quadratic minimization problem, and that similar ideas can be used to reason about 
the min-sum algorithm for general convex minimization. 

The outline of this paper is as follows: In Section 2 we review the min-sum algorithm, 
its reweighted generalizations, and the quadratic minimization problem. In Section 3 we 
discuss the relationship between pairwise message-passing algorithms and graph covers, 
and we show how to use graphs covers to characterize walk-summability. In Section 4, we 
examine the convergence of the means and the variances under the reweighted algorithm for 
the quadratic minimization problem, we examine the relationship between the reweighted 
algorithm and the Gauss-Seidel and Jacobi methods, and we compare the performance of 
the reweighted algorithm to the standard min-sum algorithm. Finally, in Section 5, we 
summarize the results and discuss extensions of this work to general convex functions as 
well as open problems. Detailed proofs of the two main theorems can be found in the 
Appendices A and B. 

2. Preliminaries 

In this section, we review the min-sum algorithm and a reweighted variant over pairwise 
factor graphs. Of particular importance for later proofs will be the computation trees 
generated by each of these algorithms. We also review the quadratic minimization problem, 
and derive the closed form message updates for this problem. 

2.1. The Min-Sum Algorithm 

The min-sum algorithm attempts to compute the minimizing assignment of an objective 
function / : Y\ - Afj — >■ M that, given a graph G = (V, E), can be factorized as a sum of self- 
potentials and edge potentials as follows: f{xi, ...,Xn) = Sjgy 4>i{xi) + Yl{i j)eE i^iji^i^^j)- 
We assume that this minimization problem is well-defined: / is bounded from below and 
there exists an a; € Hi '^j ^^^^ minimizes /. 

To each factorization, we associate a bipartite graph known as the factor graph. In 
general, the factor graph consists of a node for each of the variables xi, a node for 

each of the ipij, and for all G E, an edge joining the node corresponding to Xi to the 
node corresponding to ipij. Because the ipij each depend on exactly two factors, we often 
omit the factor nodes from the factor graph construction and replace them with a single 
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(a) Factor graph. (b) The graph G. 



Figure 1: The factor graph corresponding to /(xi, X2, X3) = (pi + (^2 + (ps + V'12 + "023 + V'ls 
and the graph G. The functions (pi,(p2, and (^3 each depend only on variable, and 
are typically omitted from the factor graph representation for clarity. 



edge. This reduces the factor graph to the graph G. See Figure 1 for an example of this 
construction. 

We can write the min-sum algorithm as a local message-passing algorithm over the graph 
G. During the execution of the min-sum algorithm, messages are passed back and forth 
between adjacent nodes of the graph. On the t*'* iteration of the algorithm, messages are 
passed along each edge of the factor graph as 

ml^j{xj) = K + mm'>pij{xi,Xj) + (piixi) + m*^_^-{xi), 

kedi\j 

where di denotes the set of neighbors of variable node Xi in the factor graph and dj \i is 
abusive notation for the set-theoretic difference dj \ {i}. When the factor graph is a tree, 
these updates are guaranteed to converge, but understanding when these updates converge 
to the correct solution for an arbitrary graph is a central question underlying the study of 
the min-sum algorithm. 

Each message update has an arbitrary normalization factor k. Because k is not a 
function of any of the variables, it only affects the value of the minimum and not where the 
minimum is located. As such, we are free to choose it however we like for each message and 
each time step. In practice, these constants are used to avoid numerical issues that may 
arise during the execution of the algorithm. 

We will think of the messages as a vector of functions indexed by the edge over which 
the message is passed. Any vector of real-valued messages is a valid choice for the vector of 
initial messages mP, and the choice of initial messages can greatly affect the behavior of the 
algorithm. A typical assumption, that we will use in this work, is that the initial messages 
are chosen such that m^^j = for all i and j. 

We can use the messages in order to construct an estimate of the min- marginals of /. 
Given any vector of messages, m*, we can construct a set of beliefs that are intended to 
approximate the min-marginals of / as 

T-{xi) = K + (pijxi) + ■m]^^{xi) 
T-j{xi,Xj) = K + il)ij{xi,Xj) + T-{xj) - rr4_^j{xj) + T/(a;i) - m*^j(xi). 
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Additionaly, we can approximate the optimal assignment by computing an estimate of 
the argmin, 

X* G argmin r/(xi). 

If the behefs corresponded to the true min-marginals of / (i.e., T^{xi) = min^,/.^/^^. /(x')), 
then for any G arg min^;^ r*(xi) there exists a vector x* such that x* = yi and x* min- 
imizes the function /. If | argmin^;. r*(xj)| = 1 for all i, then we can take x* = y, but, 
if the objective function has more than one optimal solution, then we may not be able to 
construct such an x* so easily. 

Definition 1 A vector, r = {{ti} , {rtj}) , of beliefs is locally decodable to x* if Ti{x*) < 
Ti{xi) for all i, Xj 7^ x*. Equivalently, for each i , Ti is uniquely minimized at x|. 

If the algorithm converges to a vector of beliefs that are locally decodable to x*, then 
we hope that the vector x* is a global minimum of the objective function. This is indeed 
the case when the factor graph contains no cycles (Wainwright et al., 2004) but need not 
be the case for arbitrary graphical models. 

2.1.1. Computation Trees 

An important tool in the analysis of the min-sum algorithm is the notion of a computation 
tree. Intuitively, the computation tree is an unrolled version of the original graph that 
captures the evolution of the messages passed by the min-sum algorithm needed to compute 
the belief at time t at a particular node of the factor graph. Computation trees describe 
the evolution of the beliefs over time, which, in some cases, can help us prove correctness 
and/or convergence of the message-passing updates. For example, the convergence of the 
min-sum algorithm on graphs containing a single cycle can be demonstrated by analyzing 
the computation trees produced by the min-sum algorithm at each time step (Weiss, 2000). 

The depth t computation tree rooted at node i contains all of the length t non-backtracking 
walks in the factor graph starting at node i. A walk is non-backtracking if it does not go 
back and forth successively between two vertices. For any node v in the factor graph, the 
computation tree at time t rooted at i, denoted by Ti{t), is defined recursively as follows: 
Tj(0) is just the node i, the root of the tree. The tree Tj(i) at time t > is generated from 
Ti{t — 1) by adding to each leaf of Ti{t — 1) a copy of each of its neighbors in G (and the 
corresponding edge), except for the neighbor that is already present in Tj(t — 1). Each node 
of Ti{t) is a copy of a node in G, and the potentials on the nodes in Ti{t), which operate on 
a subset of the variables in Ti{t), are copies of the potentials of the corresponding nodes in 
G. The construction of a computation tree for the graph in Figure 1 is pictured in Figure 
2. Note that each variable node in Tj(t) represents a distinct copy of some variable Xj in 
the original graph. 

Given any initialization of the messages, Tj(t) captures the information available to node 
i at time t. At time t = 0, node i has received only the initial messages from its neighbors, 
so Tj(0) consists only of i. At time t = 1, i receives the round one messages from all of its 
neighbors, so i's neighbors are added to the tree. These round one messages depend only 
on the initial messages, so the tree terminates at this point. By construction, we have the 
following lemma. 
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Figure 2: The computation tree at time t = 2 rooted at the variable node xi of the graph 
in Figure 1. The self-potentials corresponding to each variable node are given by 
the subscript of the variable. 



Lemma 2 The belief at node i produced by the min-sum algorithm at time t corresponds to 
the exact min-marginal at the root ofTi{t) whose boundary messages are given by the initial 
messages. 

Proof See, for example, Tatikonda and Jordan (2002); Weiss and Freeman (2001b). ■ 

Computation trees provide us with a dynamic view of the min-sum algorithm. After a 
finite number of time steps, we hope that the beliefs on the computation trees stop changing 
and that the message vector converges to a fixed point of the message update equations 
(in practice, when the beliefs change by less than some small amount, we say that the 
algorithm has converged). For any real-valued objective function / (i.e., |/(a;)| < oo for all 
x), there always exists a fixed point of the message update equations (see Theorem 2 of 
Wainwright et al. (2004)). 

2.2. Reweighted Message-Passing Algorithms 

Because the min-sum algorithm is not guaranteed to converge and, even it does, is not guar- 
anteed to compute the correct minimizing assignment, research has focused on the design 
of alternative message-passing schemes that do not suffer from these drawbacks. Efforts 
to produce provably convergent message-passing schemes have resulted in the rewighted 
message-passing algorithm described in Algorithm 1 . Notice that if we set Cij = 1 for all i 
and J, then we obtain the standard min-sum algorithm. In Wainwright et al. (2005), the 
Cij are chosen in a specific way in order to guarantee correctness of the algorithm (which 
they call TRMP in this special case) . In this work, we will focus on choices of these weights 
that will guarantee convergence of the algorithm for the quadratic minimization problem. 
These choices will, surprisingly, not coincide with those of the TRMP algorithm. In fact, 
the choice of weights that guarantees correctness of the TRMP algorithm must necessarily 
cause the algorithm to either not converge or converge to the incorrect solution whenever 
the given matrix is not walk-summable. 
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Algorithm 1 Synchronous Reweighted Message-Passing Algorithm 
1: Initiahze the messages to some finite vector. 
2: For iteration t = 1, 2, ... update the the messages as follows: 



kGdi\j 



The beliefs for this algorithm are defined analogously to those for the standard min-sum 
algorithm. 

r*(xi) = At + (j)i{xi) + ^ Cjimj^i{xi) 

Tlj{Xi, Xj) = K+ ^'^^^"^^^ + T^jiXj) - m\^j{Xj) + T/(Xi) - m^j_,i{Xi) 

Cij 

The vector of messages at any fixed point of the message update equations has two 
important properties. First, the beliefs corresponding to these messages provide an alterna- 
tive factorization of the objective function /. Second, the beliefs correspond to approximate 
marginals. 

Lemma 3 For any m* € M^'^l, 

f{xi,. . . ,Xlv\) = K + '^Ti{Xi) +^ ^ Tij{Xi,Xj) - Ti{Xi) - Tj{Xj) . 



Lemma 4 If t is a set of beliefs corresponding to a fixed point of the message updates in 
Algorithm 1, then 

UmiTij{Xi,Xj) = K + Ti{Xi) 

Xj 

for all S G and all Xj. 

The proof of these two lemmas is a straightforward exercise in applying the defini- 
tions. Similar results for the special case of the max-product algorithm can be found in 
Wainwright et al. (2004). 

2.2.1. Computation Trees 

The computation trees produced by Algorithm 1 are different from their predecessors. 
Again, the computation tree captures the messages that would need to be passed in or- 
der to compute T-{xi). However, the messages that are passed in the new algorithm are 
multiplied by a non-zero constant. As a result, the potential at a node u in the computa- 
tion tree corresponds to some potential in the original graph multiplied by a constant that 
depends on all of the nodes above u in the computation tree. We summarize the changes 
as follows: 
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Figure 3: Construction of the computation tree rooted at node xi at time t = 2 produced 
by Algorithm 1 for the factor graph in Figure 1. Self-potentials are adjacent to 
the variable node to which they correspond. One can check that setting Cij = 1 
for all € E reduces the above computation tree to that of Figure 2. 



1. The message passed from i to j may now depend on the message from j to i at the 
previous time step. As such, we now form the time t + 1 computation tree from the 
time t computation tree by taking any leaf u, which is a copy of node v in the factor 
graph, of the time t computation tree, creating a new node for every w € dv, and 
connecting u to these new nodes. As a result, the new computation tree rooted at 
node u of depth t contains at least all of the non-backtracking walks of length t in the 
factor graph starting from u and, at most, all walks of length t in the factor graph 
starting at u. 

2. The messages are weighted by the elements of c. This changes the potentials at the 
nodes in the computation tree. For example, suppose the computation tree was rooted 
at variable node i and that Tj depends on the message from j to i. Because niji is 
multiplied by Cij in Tj, every potential along this branch of the computation tree is 
multiplied by Cij. To make this concrete, we can associate a weight to every edge of 
the computation tree that corresponds to the constant that multiplies the message 
passed across that edge. To compute the new potential at a variable node i in the 
computation tree, we now need to multiply the corresponding potential (pi by each of 
the weights corresponding to the edges that appear along the path from i to the root 
of the computation tree. An analogous process can be used to compute the potentials 
on each of the edges. The computation tree produced by Algorithm 1 at time t = 2 for 
the factor graph in Figure 1 is pictured in Figure 3. Compare this with computation 
tree produced by the standard min-sum algorithm in Figure 2. 
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If we make these adjustments, then the behef, T^{xi), at node i at time t is given by the 
min-marginal at the root of Ti(t). In this way, the behefs correspond to marginals at the 
root of these computation trees. 



2.3. Quadratic Minimization 

We now address the quadratic minimization problem in the context of the reweighted min- 
sum algorithm. Recall that given a matrix T the quadratic minimization problem is to 
find the vector x that minimizes f{x) = ^x-^Fx — h^x. Without loss of generality, we can 
assume that the matrix F is symmetric as the quadratic function ^x-^Fx — h^x is equivalent 

to 2 



(F + F^ 



h^x for any F G M"'<": 



m = ^x^[^(F + F^) + l(F-F^) 



X — h^x 



r 



— X 

2 



^(r + F^) 



1 1 

X + -X 

X — h^x. 



^(r-r-) 



X — h^x 



Every quadratic function admits a pairwise factorization 



-x^Fx — h^x 
2 



^ ^ [p-^M-^i hjXi] -\- ^ ^ F^jXjXj, 



i>j 



where F € M"^" is a symmetric matrix. We note that we will abusively write min in the 
reweighted update equations even though the appropriate notion of minimization for the 
real numbers is inf. 

We can explicitly compute the minimization required by the reweighted min-sum algo- 



rithm at each time step: the synchronous message update m* 



can be parameterized 



as a quadratic function of the form |a^^jx| + bj^jXj. If we define 



At A 



kedi 



and 



Tjt A 



k&di 



then the updates at time t are given by 



-b 



t-i 



A:/ 

B\ 

A\. 
Aj 
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These updates are only valid when A^^j > 0. If this is not the case, then the minimization 
given in Algorithm 1 is not bounded from below, and we set = — oo. For the initial 
messages, we set a^^j = b^j = 0. A similar analysis holds for the asynchronous updates. 

Suppose that the beliefs generated from a fixed point of Algorithm 1 are locally decodable 
to X* . One can show that the gradient of / at x* is always equal to zero. If the gradient 
of / at X* is zero and T is positive definite, then x* must be a global minimum of /. In 
other words, the min-sum algorithm always computes the correct minimizing assignment if 
it converges to locally decodable beliefs. This result was proven for the GaBP algorithm in 
Weiss and Freeman (2001a) and the tree-reweighted algorithm in Wainwright et al. (2003b). 

Theorem 5 If Algorithm 1 converges to a collection of beliefs, t, that are locally decodable 
to X* for a quadratic function f , then x* is a local minimum of f . 

Proof For completeness, we sketch the proof. By Lemmas 3 and 4, we have that, 

minrij(xi,Xj) = K + Ti{xi) 

Xj 

for all S G and 

f{xi,...,X\v\) = l^ + ^^n{Xi)+\^ ^ Tij{Xi,Xj) - Ti{Xi) - Tj{Xj) . 

If T is locally decodable to x*, then for each i G V, Ti{xi) must be a positive definite 
quadratic function that is minimized at x*. Applying Lemma 4, we have that for each 
{i,j) G E, Tij is also a positive definite quadratic function and Tij is minimized at {x*,x*). 
For each i £ V, 

^/(rri, . . . ,X|v|) = -^ni^i) + [ Yl -^M""'^ " ^""^(^^ 



for all 



By the above arguments, for each i £ V, j^Tiixi) = and -^Tij{xi,Xj) 

j € di. As a result, we must have 'Vf{x\, . . . ,x*y^) = 0. If F is positive semidefinite, then 
/ is convex, and x* must be a global minimum of /. ■ 



As a consequence of Theorem 5, even if F not positive definite, if some fixed point of the 
reweighted algorithm is locally decodable to a vector x* then, x* solves the system Fx = h. 

Recall that several authors have provided sufficient conditions for the convergence of 
GaBP: Weiss and Freeman (2001a) demonstrated that GaBP converges in the case that the 
covariance matrix is diagonally dominant, Malioutov et al. (2006) proved that the GaBP 
algorithm converges when the covariance matrix is walk-summable. Moallemi and Van Roy 
(2009, 2010) showed that scaled diagonal dominance was a sufficient condition for conver- 
gence and also characterized the rate of convergence via a computation tree analysis. These 
properties of matrices will be important in the sequel. 

Definition 6 F € M"^" is scaled diagonally dominant if 3w > G M" such that 
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(a) A graph G (b) A 2-cover of G 

Figure 4: An example of a graph cover. Nodes in the cover are labeled by the node that 
they are a copy of in G. 

Definition 7 T G M"^" is walk-summable if the spectral radius q{\I — D^^/'^TD^^/'^\) < 
1. Here, D^^/^ is the diagonal matrix such that D--^^'^ = and \A\ denotes the matrix 

obtained from the matrix A by taking the absolute value of each entry of A. 

3. Graph Covers 

In this section, we will explore graph covers and their relationship to iterative message- 
passing algorithms for the quadratic minimization problem. Before addressing the quadratic 
minimization problem specifically, we will first make a few observations about general pair- 
wise graphical models. The greatest strength of the above message-passing algorithms, their 
reliance on only local information, can also be a weakness: local message-passing algorithms 
are incapable of distinguishing two graphs that have the same local structure. To make this 
precise, we will need the notion of graph covers. 

Definition 8 A graph H covers a graph G if there exists a graph homomorphism vr : if — )■ 
G such that h is an isomorphism on neighborhoods (i.e., for all vertices i ^ H , di is mapped 
bijectively onto dTT{i)). If T^{i) = j, then we say that i G H is a copy of j G G. Further, H 
is a k-cover of G if every vertex of G has exactly k copies in H . 

Graph covers, in the context of graphical models, were originally studied in relation to 
local message-passing algorithms for coding problems (Vontobel and Koetter, 2005). Graph 
covers may be connected (i.e., there is a path between every pair of vertices) or disconnected. 
However, when a graph cover is disconnected, all of the connected components of the cover 
must themselves be covers of the original graph. For a simple example of a connected graph 
cover, see Figure 4. 

Every finite cover of a connected graph is a fc-cover for some integer k. For every base 
graph G, there exists a graph, possibly infinite, which covers all finite, connected covers of 
the base graph. This graph is known as the universal cover. 

To any finite cover, H, of a factor graph G we can associate a collection of potentials 
derived from the base graph; the potential at node i G H \s equal to the potential at node 
h{i) G G. Together, these potential functions define a new objective function for the factor 
graph H. In the sequel, we will use superscripts to specify that a particular object is over 
the factor graph H. For example, we will denote the objective function corresponding to a 
factor graph H as f^, and we will write f'^ for the objective function /. 
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Local message-passing algorithms such as the rewcighted min-sum algorithm are inca- 
pable of distinguishing the two factor graphs H and G given that the initial messages to 
and from each node in H are identical to the nodes that they cover in G: for every node 
i e G the messages received and sent by this node at time t are exactly the same as the 
messages sent and received at time t by any copy of i in H. As a result, if we use a local 
message-passing algorithm to deduce an assignment for i, then the algorithm run on the 
graph H must deduce the same assignment for each copy of i. 

Now, consider an objective function / that factors over the graph G. For any finite 
cover H oi G with covering homomorphism h : H ^ G, v^e can "lift" any vector of beliefs, 
r*^, from G to by defining a new vector of beliefs, r^, such that: 

• For all variable nodes i e H, r/^ = T^f^^y 

• For ah edges eH,T^^ = T^^^^^^jy 

Analogously, we can lift any assignment x*^ to an assignment by setting = x^^-^ ■ 

3.1. Graph Covers and Quadratic Minimization 

Let G be the pairwise factor graph for the objective function /'^(xi, x„) = ^x^Tx — 
h^x whose edges correspond to the nonzero entries of T. Let H be^ a k-cavei of G with 
corresponding objective function /^(xn, xifc, ...x„fc) = ^x-^Fx — h^x. Without loss of 
generality we can assume that F and h take the following form: 



/FiiPii ••• Fi„Pi„\ 



F = 



where is a A; x A; permutation matrix for all i ^ j and Pa is the k x k identity matrix 
for all i. If F is dervied from F in this way, then we will say that F covers F. 

For the quadratic minimization problem, factor graphs and their covers share many 
of the same properties. Most notably, we can transform critical points of covers to crit- 
ical points of the original problem. Let H and G be as above, and let tt be the graph 
homomorphism from H to G. For x G RI^^I^ define lifti7(x) G RI^^I such that 

liftH(x)i = x^(j) 

for all i e H. Similarly, for each y G RI^^I, define proj(3(y) G RI^'^I such that 

proj(y)i = V ^f^rr ■\ TT 

for all i E G. With these definitions, we have the following lemma. 

Lemma 9 If Ty = h for y € rI^^I, then F • projQ{y) = h. Conversely, if Fx = h for 
y G RI^gI^ then f ■ liftH{x) = h. 
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Figure 5: An example of a positive definite matrix, T, which possesses a 2-cover, T, that 
has negative eigenvalues. 



Notice that these solutions correspond to critical points of the cover and the original 
problem. Similarly, we can transform eigenvectors of covers to either eigenvectors of the 
original problem or the zero vector. 

Lemma 10 Fix X € M. IfTy = Xy, then either T ■ pro jQ^y) = XprojQ{y) orT ■ projQ{y) = 0. 
Conversely, ifTx = Xx, then T ■ liftfj{x) = Xlift^{x). 

These lemmas demonstrate that we can average critical points and eigenvectors of covers 
to obtain critical points and eigenvectors (or the zero vector) of the original problem, and 
we can lift critical points and eigenvectors of the original problem in order to obtain critical 
points and eigenvectors of covers. 

Unfortunately, even though the critical points of G and its covers must correspond via 
Lemma 9, the corresponding minimization problems may not have the same solution. The 
example in Figure 5 illustrates that there exist positive definite matrices that are covered 
by matrices that are not positive definite. This observation seems to be problematic for 
the convergence of iterative message-passing schemes. Specifically, the fixed points of the 
reweighted algorithm on the base graph are also fixed points of the reweighted algorithm 
on any graph cover. As such, the reweighted algorithm may not converge to the correct 
minimizing assignment when the matrix corresponding to some cover of G is not positive 
definite. Consequently, we will first consider the special case in which F and all of its covers 
are positive definite. We can exactly characterize the matrices for which this property holds. 

Theorem 11 Let T he a symmetric matrix with positive diagonal. The following are equiv- 
alent: 

1. T is walk-summable. 

2. F is scaled diagonally dominant. 

3. All covers ofT are positive definite. 

4. All 2-covers ofT are positive definite. 

Proof The two non-trivial implications in the proof (4 1 and 1 2) make use of the 
Perron-Frobenius theorem. For the complete details, see Appendix A. ■ 
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This theorem has several important consequences. First, it provides us with a com- 
binatorial characterization of scaled diagonal dominance and walk-summability. Second, 
it provides an intuitive explanation for why these conditions should be sufficient for the 
convergence of local message-passing algorithms. Indeed, walk-summability and scaled 
diagonal dominance were independently shown to be sufficient conditions for the con- 
vergence of the min-sum algorithm for positive definite matrices (Malioutov et al., 2006; 
Moallemi and Van Roy, 2010). Most importantly, we can use the theorem to conclude that 
MPLP, tree-reweighted max-product, and other message-passing algorithms that guarantee 
the correctness of locally decodable beliefs cannot converge to the correct solution when T 
is positive definite but not walk-summable. To see this, note that, for these algorithms, 
if the beliefs are locally decodable to a vector x* , then x* must minimize the objective 
function. As we saw earlier, any collection of locally decodable beliefs on the base graph 
can be lifted to locally decodable beliefs on any graph cover. In other words, the lift of 
X* to each graph cover must be a global minimum on that cover. However, there exists 
at least one cover with no global minimum. As a result, these algorithms cannot converge 
to locally decodable beliefs. For more details about these types of message-passing algo- 
rithms, we refer the reader to Globerson and Jaakkola (2007),Wainwright et al. (2005), and 
Sontag and Jaakkola (2009). 

Contrast this analysis with Theorem 5. In general, the reweighted message-passing al- 
gorithm only guarantees that x* is a local optimum whenever the objective function is not 
positive semidefinite, but there exist simple choices for the reweighting parameters that 
guarantee correctness over all covers. As an example, if Cij < maxig^ \di\ ihj) ^ ^) 

then one can show that the reweighted algorithm cannot converge to locally decodable 
beliefs. The traditional choice of parameters for the TRMP algorithm where each aj cor- 
responds to an edge appearance probability provides another example. As such, in order 
to produce convergent message-passing schemes for the quadratic minimization problem, 
we will need to study choices of the parameters that do not guarantee correctness over all 
graph covers. 

4. Convergence Properties of Reweighted Message-Passing Algorithms 

Recall that GaBP algorithm can converge to the correct minimizer of the objective function 
even if the original matrix is not scaled diagonally dominant. The most significant problem 
when the original matrix is positive definite but not scaled diagonally dominant is that the 
computation trees may eventually possess negative eigenvalues due to the existence of some 
2-cover with at least one non-positive eigenvalue. If this happens, then some of the beliefs 
will not be bounded from below, and the corresponding estimate will be negative infinity. 
This is, of course, the correct answer on some 2-cover of the problem, but it is not the 
correct solution to the minimization problem of interest. 

Our goal in this section is to understand how the choice of the parameters affects the 
convergence of the reweighted algorithm. 
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4.1. Convergence of the Variances 

In this section, we wih provide conditions on the choice of the parameter vector such that 
ah of the computation trees produced by the reweighted algorithm remain positive definite 
throughout the course of the algorithm. 

Positive definiteness of the computation trees corresponds to the convexity of the beliefs, 
and the convexity of the belief, r*, is determined only by the vector a*. As such, we begin 
by studying the sequence a^,a^, ... where o° is the zero vector (based on our initialization). 
We will consider two different choices for the parameter vector: one in which Cij > 1 for all 
i and j and one in which Cij < for all i and j. 

4.1.1. Positive Parameters 

Lemma 12 If Cij > 1 for all i and j, then for all t > 0, Oi^j < < for each i and j. 

Proof This result follows by induction on t. First, suppose that Cij > 1. If the update is 
not valid, then = — oo which trivially satisfies the inequality. Otherwise, we have: 



r \2 



T 



2 



< 



where the inequality follows from the observation that ^u+'^kediXj '^kiC'l:~^i~^('^ji~^)'^'j-li ^ 
and the induction hypothesis. ■ 



If we consider only the vector a , then the algorithm may exhibit a weaker form of 
convergence: 

Lemma 13 // Cij > 1 for all i and j and all of the computation trees are positive definite, 
then the sequence a^^j,aj^j, ... converges. 

Proof Suppose Cij > 1. By Lemma 12, the are monotonically decreasing. Be- 

cause all of the computation trees are positive definite, we must have that for each i, 
^ii + Y^kadiM ^kio^k^i + ^ji'^l^i > 0- Therefore, for all {i,j) G E, a\^j > and the 

sequence a^^j,al_^j, ... is monotonically decreasing and bounded from below. This implies 
that the sequence converges. 



Because the estimates of the variances only depend on the vector a , if the a*_^j- converge, 
then the estimates of the variances also converge. Therefore, requiring all of the computation 
trees to be positive definite is a sufficient condition for convergence of the variances. Note, 
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Figure 6: A positive definite matrix for wliich the variances in the min-sum algorithm con- 
verge but the means do not. (Malioutov, 2008) 



however, that the estimates of the means which correspond to the sequence bi^j need not 
converge even if all of the computation trees are positive definite (see Figure 6). 

Our strategy will be to ensure that all of the computation trees are positive definite 
by leveraging the choice of parameters, Cij. Specifically, we want to use these parameters 
to weight the diagonal elements of the computation tree much more than the off-diagonal 
elements in order to force the computation trees to be positive definite. If we can show 
that there is a choice of each Cij = cji that will cause all of the computation trees to be 
positive definite, then Algorithm 1 should behave almost as if the original matrix were 
scaled diagonally dominant. There always exists a choice of the vector c that achieves this. 

Theorem 14 For any symmetric matrix T with strictly positive diagonal, 3r > 1 and an 
e > such that the eigenvalues of the computation trees are bounded from below by e when 
generated by Algorithm 1 with Cij = r for all i and j. 

Proof The proof of this theorem exploits the Gersgorin disc theorem in order to show that 
there exists a choice of r such that each computation tree is scaled diagonally dominant. 
The complete proof can be found in Appendix B. ■ 



4.1.2. Negative Parameters 

For the case in which Cij < for all i and j, we also have that the computation trees are al- 
ways positive definite when the initial messages are uniformly equal to zero as characterized 
by the following lemmas. 

Lemma 15 // Cy < for all i and j, then for all t > 0, al_^j < 0. 

Proof This result follows by induction on t. First, suppose that Cij > 0. If the update is 

_( iii ■ ^ 

a* 



not valid, then a*_^ • = — oo which trivially satisfies the inequality. Otherwise, we have 



( ) 



< 0, 

where the inequality follows from the induction hypothesis. 
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Algorithm 2 Asynchronous Reweighted message-passing Algorithm 
1: Initialize the messages to some finite vector. 

2: Choose some ordering of the variables such that each variable is updated infinitely often, 

and perform the following update for each variable j in order 
3: for each i G dj do 
4: Update the message from i to j: 



-K + mm 



■' k&di\j 
5: end for 



Lemma 16 For any symmetric matrix T with strictly positive diagonal, if Cij < for all i 
and j, then all of the computation trees are positive definite. 

Proof The computation trees are all positive definite if and only if Tu + Ylkedi ^kia\^i > 
for all t. By Lemma 15, < for all t, and as result, Ta + J2k<£di '^kiO-k^i ^ > for 
ah t. U 



As in the case when Cij > 1 for all € E, the eigenvalues on each computation 

tree are again bounded way from zero, but the no longer form a monotonic decreasing 
sequence when Cij < for all {i,j) G E. If all of the computation trees remain positive 
definite in the limit, then the beliefs will all be positive definite upon convergence. If 
the estimates for the means converge as well, then the converged beliefs must be locally 
decodable to the correct minimizing assignment. Notice that none of the above arguments 
for the variances require T to be positive definite. Indeed, we have already seen an example 
of a matrix with a strictly positive diagonal and negative eigenvalues (see the matrix in 
Figure 5) such that variance estimates converge. 

4.2. Synchronous Versus Asynchronous Updates 

The synchronous message-passing updates described in Algorithm 1 enforce a particular or- 
dering on the updates performed at each time step. We can construct an asynchronous ver- 
sion of Algorithm 1 by allowing some arbitrary ordering of message updates. The resulting 
asynchronous algorithm is given by Algorithm 2. Because each asynchronous computation 
tree is a principal submatrix of a synchronous computation tree and principal submatrices 
of positive definite matrices are positive defintie, we can easily check that all of the results 
of the previous section extend to this asynchronous algorithm as well. 

Asynchronous algorithms allow for quite a bit more flexibility in the scheduling of mes- 
sage updates, and as we will see experimentally in Section 4.4, asynchronous algorithms can 
have better convergence properties than the corresponding synchronous algorithms. To see 
why this might be the case, we will again exploit the properties of graph covers. Specifically, 
we will show that these two algorithms are related via a special 2-cover of the base factor 
graph. 
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(a) A pairwise (b) Bipartite 2-cover of G 

factor 



graph G 

Figure 7: The Kronecker double cover (b) of a pairwise factor graph (a). The node labeled 
i G G corresponds to the variable node Xj. 



Algorithm 3 Bipartite Asynchronous Algorithm 



1: Initialize the messages to some finite vector. 

2: Iterate the following until convergence: update all of the outgoing messages from nodes 
labeled one to nodes labeled two and then update all of the outgoing messages from 
nodes labeled two to nodes labeled one using the asynchronous update rule: 



(xj) ■.=K + niin _^ (g.^. _ i)mj_,i{xi) + + ^ Ckimk^i{xi] 

^''^ kedi\j 



Every pairwise factor graph, G = {Vg,Eg), admits a bipartite 2-cover, H = {Vg x 
{1,2}, Eh), called the Kronecker double cover of G. We will denote copies of the variable 
Xi in this 2-cover as Xj^ and Xi^. For every edge (i,j) G Eg, (ii, ^2) and (i2,ii) belong to 
Eh- In this way, nodes labeled with a one are only connected to nodes labeled with a two 
(see Figure 7). Note that if G is already a bipartite graph, then the Kronecker double cover 
of G is simply two disjoint copies of G. 

We can view the synchronous algorithm described in Algorithm 1 as a specific asyn- 
chronous algorithm on the Kronecker double cover where we perform the asynchronous 
update for every variable in the same partition on alternating iterations (see Algorithm 3). 

By construction, the message vector produced by Algorithm 3 is simply a concatenation 
of two consecutive time steps of the synchronous algorithm. Specifically, for all t > 1 



rriH 



Therefore, the messages passed by Algorithm 1 are identical to those passed by an 
asynchronous algorithm on the Kronecker double cover. From our earlier analysis, we know 
that even if F is positive definite, not every cover necessarily corresponds to a convex 
objective function. If the Kronecker double cover is such a "bad" cover, then we might 
expect that synchronous reweighted algorithm may not converge to the correct solution. 
This reasoning is not unique to iterative message-passing algorithms. In the next section, 
we will see that it can be applied to other iterative techniques for quadratic minimization. 
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Algorithm 4 Jacobi Iteration 

1: Choose an initial vector € M". 
2: For iteration t = 1, 2, ... set 

t _ ~ J2k ^jk^k 



for each j G {1, n}. 

Algorithm 5 Gauss-Seidel Iteration 
1: Choose an initial vector x G M". 

2: Choose some ordering of the variables, and perform the following update for each vari- 
able j, in order: 

_ ~ Ylk ^jkXk 



4.2.1. The Gauss-Seidel and Jacob: Methods 

Because minimizing symmetric positive definite quadratic functions is equivalent to solv- 
ing symmetric positive definite linear systems, well-studied algorithms such as Gaussian 
elimination, Cholesky decomposition, etc. can be used to compute the minimum. In ad- 
dition, many iterative algorithms have been proposed to solve the linear system Fx = h: 
Gauss-Seidel iteration, Jacobi iteration, the algebraic reconstruction technique, etc. 

In this section, we will show that the previous graph cover analysis can also be used 
to reason about the Jacobi and Gauss-Seidel algorithms (Algorithms 4 and 5). When T is 
symmetric positive definite, the objective function, ^x'^Tx — h^x, is a convex function of 
X. Consequently, we could use a coordinate descent scheme in an attempt to minimize the 
objective function. The standard cyclic coordinate descent algorithm for this problem is 
known as the Gauss-Seidel algorithm. 

In the same way that Algorithm 1 is a synchronous version of Algorithm 2, the Jacobi 
algorithm is a synchronous version of the Gauss-Seidel algorithm. To see this, observe that 
the iterates produced by the Jacobi algorithm are related to the iterates of the Gauss-Seidel 
algorithm on a larger problem. Specifically, given a symmetric T € M"^"- and h € M", 
construct F' G m2"x2" and h' G M^n follows: 

[m D\ ' 

where D is a diagonal matrix with the same diagonal entries as F and M = F — D. 

F' is the analog of the Kronecker double cover discussed in Section 4.2. Let x^ G 
be an initial vector for the Jacobi algorithm performed on the matrix F and fix G M^*^ 
such that y'- = mod n)- Further, suppose that we update the variables in the order 

1,2,.. .,2n in the Gauss-Seidel algorithm. If y* is the vector produced after t complete cycles 
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of the Gauss-Seidel algorithm, then 



h. 



Also, observe that, for any y* such that 



T'y = h' , we must have that T 

With these two observations, any convergence result for the Gauss-Seidel algorithm can 
be extended to the Jacobi algorithm. Consider the following: 



Theorem 17 Let T be a symmetric positive semidefinite matrix with a strictly positive 
diagonal. The Gauss-Seidel algorithm converges to a vector x* such that Tx* = h whenever 
such a vector exists. 



Proof See Section 10.5.1 of Byrne (2008). 



Using our observations, we can immediately produce the following new result: 

Corollary 18 Let T be a symmetric positive semidefinite matrix with positive diagonal and 
let V be constructed as above. IfV is a symmetric positive semidefinite matrix and there 
exists an x* such that Fx* = h, then the sequence ^^-^ — converges to x* where is the 
^th ^^gj-QT^g qJ ffiQ Jacobi algorithm. 

If r' is not positive semidefinite, then the Gauss-Seidel algorithm (and by extension the 
Jacobi algorithm) may or may not converge when run on V. 

4.3. Convergence of the Means 

If the variances converge, then the fixed points of the message updates for the means 
correspond to the solution of a particular linear system Mb = d. In fact, we can show that 
Algorithm 2 is exactly the Gauss-Seidel algorithm for this linear system. First, we construct 
the matrix M G M2|i=;|x2|£|. 

A*\^ - for all i and j € di 

Cfcj for alH G y and for all j, k & di such that k ^ j 

Cij 

(cij — 1)— for all i eV and j G di. 

Cij 

Here, A* is constructed from the vector of converged variances, a* . All other entries of the 
matrix are equal to zero. Next, we define the vector d G M^l^l by setting dij = hiVij/cij for 
all i and j G di. 

By definition, any fixed point, 6*, of the message update equations for the means must 
satisfy Mb* = d. With these definitions, Algorithm 2 is precisely the Gauss-Seidel algorithm 
for this matrix. Similarly, Algorithm 1 corresponds to the Jacobi algorithm. Unfortunately, 
M is neither symmetric nor diagonally dominant, so the standard results for the convergence 
of the Gauss-Seidel algorithm do not necessarily apply to this situation. In practice, we have 
observed that the asynchronous reweighted message-passing algorithm converges if each Cij 
is sufficiently large (or sufficiently negative). 



Mij,ij — 
Mij^ki = 
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Figure 8: The error, measured by the 2-norm, between the current mean estimate and the 
true mean at each step of the min-sum algorithm, the asynchronous algorithm 
with Cij = 2 for ah i ^ j, and the synchronous algorithm with Cij = 2 for all i ^ j 
for the matrix in (2). Notice that all of the algorithms have a similar performance 
when p is chosen such that the matrix is scaled diagonally dominant. When the 
matrix is not scaled diagonally dominant, the min-sum algorithm converges more 
slowly or does not converge at all. 
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4.4. Experimental Results 

Even simple experiments demonstrate the advantages of the reweighted message-passing 
algorithm compared to the typical min-sum algorithm. Throughout this section, we will 
assume that h is chosen to be the vector of all ones. Let F be the following matrix. 



/I 


p 


-p 


-p\ 


p 


1 


-p 





-p 


-p 


1 


-p 


\-p 





-p 


1 / 



(2) 



The standard min-sum algorithm converges to the correct solution for < p < .39865 
(Malioutov et al., 2006). Figure 8 illustrates the behavior of the min-sum algorithm, the 
asynchronous reweighted message-passing algorithm with Cij = 2 for all i ^ j, and the 
synchronous algorithm with Cij = 2 for all i ^ j for different choices of the constant p. 
Each iteration of the asynchronous algorithm consists of cyclically updating all messages. 
In the examples in Figure 8, the synchronous and asynchronous algorithm always converge 
rapidly to the correct mean while the min-sum algorithm converges slowly or not at all as 
p approaches .5. 

While this is a simple graph, the behavior of the algorithm for different choices of the 
vector c is already apparent. If we set Cij = 3 for all i ^ j, then empirically, both the 
synchronous and asynchronous algorithms converge for all p £ (—.5, .5), which is the entire 
positive definite region for this matrix. However, different choices of the parameter vector 
can greatly increase or decrease the number of iterations required for convergence. Figure 
9 illustrates the iterations to convergence for the reweighted algorithms at p = .4 versus c. 

Although both the synchronous and asynchronous algorithms converge for the entire 
positive definite region in the above example, the synchronous and asynchronous algorithms 
can have very different convergence properties and damping may be required in order to 
force the synchronous algorithm to converge over arbitrary graphs, even for sufficiently large 
c. Figure 10 illustrates these convergence issues for the matrix, 
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134/ 



(3) 



The above matrix was randomly generated. Similar observations can be made for many 
other positive definite matrices as well. 

5. Conclusions and Future Research 

In this work, we explored the properties of reweighted message-passing algorithms for the 
quadratic minimization problem. Our motivation was to address the convergence issues in 
the GaBP algorithm by leveraging the reweighting. To this end, we employed graph covers 
to prove that standard approaches to convergence and correctness that exploit duality and 
coordinate ascent /descent such as MPLP (Globerson and Jaakkola, 2007), tree-reweighted 
max-product (Wainwright et al., 2003a), and Sontag and Jaakkola (2009) are doomed to 
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Figure 9: The number of iterations needed to reduce the error of the mean estimates below 
10~^ using the reweighted algorithms as a function of c for the matrix in (2) with 
p = .4. The gap in the plot is predicted by the arguments at the end of Section 
3.1. 
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Figure 10: The number of iterations needed to reduce the error of the mean estimates below 
10~^ using the reweighted algorithms as a function of c for the matrix in (3). 
Again, the gap in the plot is predicted by the arguments at the end of Section 
3.1. 
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fail outside of walk-summable models. While the GaBP variances may not converge outside 
of walk-summable matrices, we showed that there always exists a choice of reweighing 
parameters that guarantees monotone convergence of the variances. Empirically, a similar 
strategy seems to guarantee convergence of the means as well. As a result, our approach 
demonstrably outperforms the GaBP algorithm on this problem. We conclude this work 
with a discussion of a few open problems and directions for future research. 

5.1. Convergence 

The main open questions surrounding the performance of the reweighted algorithm relate 
to questions of convergence. First, for all positive definite F, we conjecture that there exists 
a sufficiently large (or sufficiently negative) choice of the parameters such that the means 
always converge. 

Second, in practice, one typically uses a damped version of the message updates in order 
to attempt to force convergence. For the min-sum algorithm, the damped updates are given 



The damped min-sum algorithm with damping factor 5 = 1/2 empirically seems to 
converge if F is positive definite and all of the computation trees remain positive definite 
(Malioutov et al., 2006). We make the same observation for the damped version of Algo- 
rithm 1. 

In practice, the damped synchronous algorithm with (5 = 1/2 and the asynchronous 
algorithm appear to converge for all sufficiently large choices of the parameter vector as 
long as F is positive definite. We conjecture that this is indeed the case: for all positive 
definite F there exists a c such that if Cij = c for all i ^ j, then the asynchronous algorithm 
always converges. In this line of exploration, the relationship between the synchronous and 
the asynchronous algorithms described in Section 4.2 may be helpful. 

Finally, Moallemi and Van Roy (2010) were able to provide rates of convergence in the 
case that F is walk-summable by using a careful analysis of the computation trees. Per- 
haps similar ideas could be adapted for the computation trees produced by the reweigthed 
algorithm. 

5.2. General Convex Minimization 

The previous graph cover observations can, in theory, be applied to minimize general convex 
functions, but in practice, computing and storing the message vector may be inefficient. 
Despite this, many of the previous observations can be extended to any convex function 
/ : C — > M such that C C M" is a convex set. 

As was the case for quadratic minimization, convexity of the objective function f'^ does 
not necessarily guarantee convexity of the objective function for every finite cover H 
of G. Recall that the existence of graph covers that are not bounded from below can be 
problematic for the reweighted message-passing algorithm. For quadratic functions, this 
cannot occur if the matrix is scaled diagonally dominant or, equivalently, if the objective 



by 




k&di\j 
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function corresponding to every finite graph cover is positive definite. This equivalence sug- 
gests a generahzation of scaled diagonal dominance for arbitrary convex functions based on 
the convexity of their graph covers. Such convex functions would have desirable properties 
with respect to iterative message-passing schemes. 

Lemma 19 Let f be a convex function that factorizes over a graph G. Suppose that for 
every finite cover H ofG, f^ is convex. If x'^ € arg min^; /(x), then for every finite cover 
H ofG, x^ , the lift of x^ to H, minimizes f^. 

Proof This follows from the observation that all convex functions are subdifferentiable over 
their domains and that x^ is a minimum of f^ if and only if the zero vector is contained 
in the subgradient of f^ at x^ . ■ 

Even if the objective function is not convex for some cover, we may still be able to use 
the same trick as in Theorem 14 in order to force the computation trees to be convex. Let 
C C M" be a convex set. If / : C — )■ M is twice continuously differentiable, then / is convex 
if and only if its Hessian, the matrix of second partial derivatives, is positive semidefinite 
on the interior of G. For each fixed x € C, Theorem 14 demonstrates that there exists a 
choice of the vector c such that all of the computation trees are convex at x, but it does 
not guarantee the existence of a c that is independent of x. 

For twice continuously differentiable functions, sufficient conditions for the convergence 
of the min-sum algorithm that are based on a generalization of scaled diagonal dominance 
are discussed in Moallemi and Van Roy (2010), and extending the above ideas is the subject 
of future research. 

Appendix A. Proof of Theorem 11 

Without loss of generality, we can assume that F has a unit diagonal. We break the proof 
into several pieces: 

• (1 2) Without loss of generality we can assume that \I — F| is irreducible (if not 
we can make this argument on each of its connected components). Let 1 > A > be 
an eigenvalue of |/ — F| with eigenvector x > whose existence is guaranteed by the 
Perron-Frobenius theorem. For any row i, we have: 



Since Ta = 1 this is the definition of scaled diagonal dominance with w = x. 

• (2 =^ 3) If F is scaled diagonally dominant then so is every one of its covers. Scaled 
diagonal dominance of a symmetric matrix with a positive diagonal implies that the 
matrix is symmetric positive definite. Therefore, all covers must be symmetric positive 
definite. 




• (3^4) Trivial. 
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(4 =^ 1) Let r be any 2-cover of T. Without loss of generality, we can assume that F 
has the form (1). 

First, observe that by the Perron- Frobenius theorem there exists an eigenvector x > 
G M" of |/ — F| with eigenvalue q{\I — T\). Let y G M^" be constructed by duplicating 
the values of x so that y2i = y2i+i = Xi for each i £ {0...n}. By Lemma 10, y is an 
eigenvector of |I — F| with eigenvalue equal to q{\I — T\). We claim that this implies 
q{\I — F|) = ^>(|/ — F|). Assume without loss of generality that \I — T\ is irreducible; if 
not, then we can apply the following argument to each connected component of |/ — r|. 
By the Perron-Frobenius theorem again, \I — F| has a unique positive eigenvector (up 
to scalar multiple), with eigenvalue equal to the spectral radius. Thus, q{\I — T\) = 
q{\I — F|) because y > 0. 

We will now construct a specific cover F such that F is positive definite if and only 
if F is walk-summable. To do this, we'll choose the Pij as in (1) such that Pij = / if 

1' 

1 0, 

where the constant c ensures that ||z|| = 1. 



Tij < and Pij 



otherwise. Now define z G M by setting Zi = (— l)*cy< 



Consider the following: 



Tz = y^^y^^Tij[z2i, z2i+i]Pi;j 

= 1 - 2^ |Fij|c^yi?/j. 



Z2j 



7,7, Z^ 



i>j 

Recall that y is the eigenvector of | / — F| corresponding to the largest eigenvalue and 
||cy|| = 1. By definition and the above. 



^'(l^-ri) 



■^|/-F|cy 



cy 



2 T 

c^y y 



2^ IF.,. 

i>j 



ij I c yi yj ■ 



Combining all of the above we see that z^Tz = 1 — q{\I — T\). Now, F positive definite 
implies that z'^Tz > 0, so 1 — — F|) > 0. In other words, F is walk-summable. 



Appendix B. Proof of Theorem 14 

Let T^{t) be the depth t computation tree rooted at v, and let F' be the matrix corresponding 
to Tjj{t) (i.e., the matrix generated by the potentials in the computation tree). We will show 
that the eigenvalues of F' are bounded from below by some e > 0. For any i S T^(t) at 
depth d define: 
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where r is as in the statement of the theorem and s is a positive real to be determined below. 
Let be a diagonal matrix whose entries are given by the vector w. By the Gersgorin disc 
theorem (Horn and Johnson, 1990), all of the eigenvalues of VF~^r'VF are contained in 

Uier„(t){^ e M : |z - < ■ 

Because all of the eigenvalues are contained in these discs, we need to show that there is a 
choice of s and r such that for all i G T^{t), \T'--\ — Yljj^i'^j\^'ij\ — ^■ 

ir ■ ■ I 

Recall from Section 2.2.1 that |r^| = for some constant r] that depends on r. 

Further, all potentials below the potential on the edge are multiplied by r/7 for some 
constant 7. We can divide out by this common constant to obtain equations that depend 
on r and the elements of T. Note that some self-potentials will be multiplied by r — 1 while 
others will be multiplied by r. With this rewriting, there are three possibilities: 

Ir ■ 1 

1. i is a leaf of Ty{t). In this case, we need |rjj| > ^^^^^(j)- Plugging in the 
definition of Wi, we have 

\ru\ > (4) 



2. i is not a leaf of Ty(t) or the root. In this case, we need 
1 r\r. ,..\ .2 



Wi L r 



^P{i)^ ^3 \>^ip{i)\Wp{i) + 2^ \Tki\wk 



Again, plugging the definition of Wi into the above yields 

r — 1 



k&di—p{i) 



3. i is the root of Ty{t). Similar to the previous case, we need |rjj|i(;j > X^fceSi l^fcil^fc- 
Again, plugging the definition of wi into the above yields 



k&di 



None of these bounds are time dependent. As such, if we choose s and r to satisfy the 
above constraints, then there must exist some e > such that smallest eigenvalue of any 

computation tree is at least e. Fix s to satisfy (4) for all leaves of T^{t). This implies that 

Ir - - I 

!£iiL) > for any i € Ty{t). Finally, we can choose a sufficiently large r that 

satisfies the remaining two cases for all i e Ty(t). 
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